This quickstart guide explains how to match two tables using Magellan. Our goal is to come up with a workflow to match DBLP and ACM datasets. Specifically, we want to achieve precision greater than 95% and get recall as high as possible. The datasets contain information about the conference papers published in top databse conferences.
First, we need to import the Magellan package as follows:
In [1]:
# Import libraries
import py_entitymatching as em
import pandas as pd
import os, sys
In [2]:
print('python version: ' + sys.version )
print('pandas version: ' + pd.__version__ )
print('magellan version: ' + em.__version__ )
Matching two tables typically consists of the following three steps:
We begin by loading the input tables. For the purpose of this guide, we use the datasets that are included with the package.
In [3]:
#dblp_dataset_path = os.sep.join(['DBLP_ACM', 'DBLP_cleaned.csv'])
#acm_dataset_path = os.sep.join(['DBLP_ACM', 'ACM_cleaned.csv'])
datasets_dir = em.get_install_path() + os.sep + 'datasets'
dblp_dataset_path = datasets_dir + os.sep + 'DBLP.csv'
acm_dataset_path = datasets_dir + os.sep + 'ACM.csv'
In [4]:
# Load csv files as dataframes and set the key attribute in the dataframe
A = em.read_csv_metadata(dblp_dataset_path, key='id')
B = em.read_csv_metadata(acm_dataset_path, key='id')
In [5]:
print('Number of tuples in A: ' + str(len(A)))
print('Number of tuples in B: ' + str(len(B)))
print('Number of tuples in A X B (i.e the cartesian product): ' + str(len(A)*len(B)))
In [6]:
A.head(2)
Out[6]:
In [7]:
B.head(2)
Out[7]:
In [8]:
# Display the key attributes of table A and B.
em.get_key(A), em.get_key(B)
Out[8]:
In [9]:
# If the tables are large we can downsample the tables like this
A1, B1 = em.down_sample(A, B, 500, 1)
# But for the demo, we will use the entire table A and B
Before we do the matching, we would like to remove the obviously non-matching tuple pairs from the input tables. This would reduce the number of tuple pairs considered for matching.
Magellan provides four different blockers: (1) attribute equivalence, (2) overlap, (3) rule-based, and (4) black-box. Refer to [api reference] for more details. The user can mix and match these blockers to form a blocking sequence applied to input tables.
For the matching problem at hand, we know that two conference papers published in different years cannot match. So, we decide to apply an attribute equivelance blocker on the 'year' attribute.
In [10]:
# Plan
# A, B ------ attribute equivalence [year] -----> C1
In [11]:
# Create attribute equivalence blocker
ab = em.AttrEquivalenceBlocker()
# Block tables using 'year' attribute: same year then include in the canidate set
C1 = ab.block_tables(A, B, 'year', 'year',
l_output_attrs=['title', 'authors', 'year'],
r_output_attrs=['title', 'authors', 'year']
)
In [12]:
# Check the number of rows in C1
len(C1)
Out[12]:
In [13]:
# Display first two rows from C1
C1.head(2)
Out[13]:
The number of tuple pairs considered for matching is reduced to 601284 (from 6001104), but we would want to make sure that the blocker did not drop any potential matches. We could debug the blocker output in Magellan as follows:
In [14]:
# Debug blocker output
dbg = em.debug_blocker(C1, A, B, output_size=200)
In [15]:
# Display first few tuple pairs from the debug_blocker's output
dbg.head()
Out[15]:
From the debug blocker's output we observe that the current blocker drops quite a few potential matches. We would want to update the blocking sequence to avoid dropping these potential matches.
For the considered dataset, we know that for the conference papers to match the author names must overlap between them. We could use overlap blocker for this purpose. Finally, we would want to union the outputs from the attribute equivalence blocker and the overlap blocker to get a consolidated candidate set.
In [16]:
# Updated blocking sequence
# A, B ------ attribute equivalence [year] -----> C1--
# |----> C
# A, B ------ overlap blocker [authors] --------> C2--
In [17]:
# Create an overlap blocker
ob = em.OverlapBlocker()
# Apply overlap blocker on 'authors' attribute
C2 = ob.block_tables(A, B, 'authors', 'authors',
l_output_attrs=['title', 'authors', 'year'],
r_output_attrs=['title', 'authors', 'year']
)
In [18]:
# Check the number of rows in C2
len(C2)
Out[18]:
In [19]:
# Display first two rows from C2
C2.head(2)
Out[19]:
In [20]:
# Combine blocker outputs
C = em.combine_blocker_outputs_via_union([C1, C2])
In [21]:
# Check the number of rows in the consolidated candidate set.
len(C)
Out[21]:
In [22]:
# Debug again
dbg = em.debug_blocker(C, A, B)
In [23]:
# Display first few rows from the debugger output
dbg.head(3)
Out[23]:
We observe that the current blocker sequence does not drop obvious potential matches, and we can proceed with the matching step now. A subtle point to note here is, debugging blocker output practically provides a stopping criteria for modifying the blocker sequence.
In this step, we would want to match the tuple pairs in the candidate set. Specifically, we use learning-based method for matching purposes.
This typically involves the following five steps:
First, we randomly sample 450 tuple pairs for labeling purposes.
In [24]:
# Sample candidate set
S = em.sample_table(C, 450)
Next, we label the sampled candidate set. Specify we would enter 1 for a match and 0 for a non-match.
In [25]:
# Label S and specify the attribute name for the label column
L = em.label_table(S, 'gold')
For the purposes of this guide, we will load in a pre-labeled dataset (of 415 tuple pairs) included in this package.
In [26]:
# Load the pre-labeled data
labeled_dataset_path = datasets_dir + os.sep + 'dblp_acm_demo_labels.csv'
L = em.read_csv_metadata(labeled_dataset_path,
key='_id',
ltable=A, rtable=B,
fk_ltable='ltable.id', fk_rtable='rtable.id')
# Display the number of rows in the labaled data set
len(L)
Out[26]:
In this step, we split the labeled data into two sets: development and evaluation. Specifically, the development set is used to come up with the best learning-based matcher and the evaluation set used to evaluate the selected matcher on unseen data.
In [27]:
# Split the labeled data into development and evaluation set
development_evaluation = em.split_train_test(L, train_proportion=0.7)
development = development_evaluation['train']
evaluation = development_evaluation['test']
Selecting the best learning-based matcher typically involves the following steps:
First, we need to create a set of learning-based matchers. The following matchers are supported in Magellan: (1) decision tree, (2) random forest, (3) naive bayes, (4) svm, (5) logistic regression, and (6) linear regression.
In [28]:
# Create a set of ML-matchers
dt = em.DTMatcher(name='DecisionTree')
svm = em.SVMMatcher(name='SVM')
rf = em.RFMatcher(name='RF')
nb = em.NBMatcher(name='NB')
lg = em.LogRegMatcher(name='LogReg')
ln = em.LinRegMatcher(name='LinReg')
In [29]:
# Generate features
feature_table = em.get_features_for_matching(A, B, validate_inferred_attr_types=False)
In [30]:
# List the names of the features generated
feature_table['feature_name']
Out[30]:
In [31]:
# Select the year related features
feature_subset_iter1 = feature_table[10:14]
In [32]:
# List the names of the features selected
feature_subset_iter1['feature_name']
Out[32]:
In [33]:
# Extract feature vectors
feature_vectors_dev = em.extract_feature_vecs(development,
feature_table=feature_subset_iter1,
attrs_after='gold')
In [34]:
# Display first few rows
feature_vectors_dev.head(3)
Out[34]:
Next, we might have to impute the feature vectors as it might contain missing values. First, let us check if there are any missing values in the extracted feature vectors.
In [35]:
# Check if the feature vectors contain missing values
# A return value of True means that there are missing values
any(pd.isnull(feature_vectors_dev))
Out[35]:
We observe that the extracted feature vectors contain missing values. We have to impute the missing values for the learning-based matchers to fit the model correctly. For the purposes of this guide, we impute the missing value in a column with the mean of the values in that column.
In [36]:
# Impute feature vectors with the mean of the column values.
feature_vectors_dev = em.impute_table(feature_vectors_dev,
exclude_attrs=['_id', 'ltable.id', 'rtable.id', 'gold'],
strategy='mean')
In [37]:
# Select the best ML matcher using CV
result = em.select_matcher([dt, rf, svm, nb, lg, ln], table=feature_vectors_dev,
exclude_attrs=['_id', 'ltable.id', 'rtable.id', 'gold'],
k=10,
target_attr='gold',
metric_to_select_matcher='precision',
random_state=0)
In [38]:
result['cv_stats']
Out[38]:
In [39]:
result['drill_down_cv_stats']['precision']
Out[39]:
In [40]:
# # Split feature vectors into train and test
train_test = em.split_train_test(feature_vectors_dev, train_proportion=0.5)
train = train_test['train']
test = train_test['test']
In [41]:
# Debug decision tree using GUI
em.vis_debug_rf(rf, train, test,
exclude_attrs=['_id', 'ltable.id', 'rtable.id', 'gold'],
target_attr='gold')
From the GUI, we observe that using only 'year' related features result in a lot of false positives. So we decide to use all the features in the feature table (which had author, title and year related features).
In [42]:
# Select all features from the feature table
feature_subset_iter2 = feature_table
Now, we repeat extracting feature vectors (this time with updated feature table), imputing table and selecting the best matcher again using cross-validation.
In [43]:
# Get new set of features
feature_vectors_dev = em.extract_feature_vecs(development, feature_table=feature_subset_iter2, attrs_after='gold')
In [44]:
# Check if imputation is required
any(pd.isnull(feature_vectors_dev))
Out[44]:
In [45]:
# Impute feature vectors
feature_vectors_dev = em.impute_table(feature_vectors_dev,
exclude_attrs=['_id', 'ltable.id', 'rtable.id', 'gold'],
strategy='mean')
In [46]:
# Apply cross validation to find if there is a better matcher
result = em.select_matcher([dt, rf, svm, nb, lg, ln], table=feature_vectors_dev,
exclude_attrs=['_id', 'ltable.id', 'rtable.id', 'gold'],
k=10,
target_attr='gold',
metric_to_select_matcher='precision',
random_state=0)
In [47]:
result['cv_stats']
Out[47]:
In [48]:
result['drill_down_cv_stats']['precision']
Out[48]:
Now, observe the best matcher is achieving the expected precision and we can proceed on to evaluating the best matcher on the unseen data (the evaluation set).
Evaluating the matching outputs for the evaluation set typically involves the following four steps:
As before, we extract the feature vectors (using the updated feature table and the evaluation set) and impute it (if necessary).
In [49]:
# Get new set of features
feature_vectors_eval = em.extract_feature_vecs(evaluation,
feature_table=feature_subset_iter2,
attrs_after='gold')
In [50]:
# Check if the feature vectors contain missing values
# A return value of True means that there are missing values
any(pd.isnull(feature_vectors_eval))
Out[50]:
In [51]:
# Impute feature vectors
feature_vectors_eval = em.impute_table(feature_vectors_eval,
exclude_attrs=['_id', 'ltable.id', 'rtable.id', 'gold'],
strategy='mean')
In [52]:
# Train using feature vectors from the development set
rf.fit(table=feature_vectors_dev,
exclude_attrs=['_id', 'ltable.id', 'rtable.id', 'gold'],
target_attr='gold')
In [53]:
# Predict M
predictions = rf.predict(table=feature_vectors_eval,
exclude_attrs=['_id', 'ltable.id', 'rtable.id', 'gold'],
append=True,
target_attr='predicted',
inplace=False)
In [54]:
# Evaluate the result
eval_result = em.eval_matches(predictions, 'gold', 'predicted')
em.print_eval_summary(eval_result)